perm filename CHAP6[4,KMC]4 blob
sn#036268 filedate 1973-04-17 generic text, type T, neo UTF8
00010 CHAPTER SIX
00100 MODEL VALIDATION
00200 (In collaboration with Franklin Dennis Hilf)
00300
00500
00600
00700 There are several meanings to the term "validate" which
00800 derive from the Latin VALIDUS= strong. Thus to validate X means to
00900 strengthen it. In science it usually means to strengthen X's
01000 acceptability as a hypothesis, theory , or model. Lurking in the
01100 background there is usually some concept of truth or authenticity.
01300 In a purely instrumentalist view theories are simply
01400 calculating or predicting devices for human convenience. They do not
01500 explain and it is unjustified to apply the terms of truth or falsity
01600 to them. Under a realist view one seeks explanatory truth,
01700 that which really is the case, and hence proposed theories must be
01800 evaluated for their authenticity. Since absolute truth cannot be attained
01900 we must settle for degrees of approximations.
02000 To validate, then, is to carry out procedures
02100 which show to what degree X, or its consequences, correspond with
02200 facts of observation. We compare the model with its natural counterpart
02210 The failures should be constructive yielding new information.Discrepancies
02220 in the comparison reveal what is not understood and must be modified in the model. After modifications
02230 are made, a fresh comparison is made with the natural counterpart and
02240 we repeatedly cycle through this procedure attempting to gain convergence.
02400
02500 Once a simulation model reaches a stage of intuitive
02600 adequacy, a model builder should consider using more stringent
02700 evaluation procedures relevant to the model's purposes. For example,
02800 if the model is to serve as a as a training device, then a simple
02900 evaluation of its pedagogic effectiveness would be sufficient. But
03000 when the model is proposed as an explantion of a psychological
03100 process, more is demanded of the evaluation procedure. In the area of
03200 simulation models Turing's test has often been suggested as a validation procedure.
03300 It is very easy to become confused about Turing's Test. In
03400 part this is due to Turing himself who introduced the now-famous
03500 imitation game in a paper entitled COMPUTING MACHINERY AND
03600 INTELLIGENCE (Turing,1950). A careful reading of this paper reveals
03700 there are actually two imitation games , the second of which is
03800 commonly called Turing's test.
03900 In the first imitation game two groups of judges try to
04000 determine which of two interviewees is a woman. Communication between
04100 judge and interviewee is by teletype. Each judge is initially
04200 informed that one of the interviewees is a woman and one a man who
04300 will pretend to be a woman. After the interview, the judge is asked
04400 what we shall call the woman-question i.e. which interviewee was the
04500 woman? Turing does not say what else the judge is told but one
04600 assumes the judge is NOT told that a computer is involved nor is he
04700 asked to determine which interviewee is human and which is the
04800 computer. Thus, the first group of judges would interview two
04900 interviewees: a woman, and a man pretending to be a woman.
05000 The second group of judges would be given the same initial
05100 instructions, but unbeknownst to them, the two interviewees would be
05200 a woman and a computer programmed to imitate a woman. Both groups
05300 of judges play this game until sufficient statistical data are
05400 collected to show how often the right identification is made. The
05500 crucial question then is: do the judges decide wrongly AS OFTEN when
05600 the game is played with man and woman as when it is played with a
05700 computer substituted for the man. If so, then the program is
05800 considered to have succeeded in imitating a woman as well as a man
05900 imitating a woman. For emphasis we repeat; in asking the
06000 woman-question in this game, judges are not required to identify
06100 which interviewee is human and which is machine.
06200 Later on in his paper Turing proposes a variation of the
06300 first game. In the second game, one interviewee is a man and one is a
06400 computer. The judge is asked to determine which is man and which is
06500 machine, which we shall call the machine-question. It is this version
06600 of the game which is commonly thought of as Turing's test. It has
06700 often been suggested as a means of validating computer simulations of
06800 psychological processes.
06900 In the course of testing a simulation (PARRY) of paranoid
07000 linguistic behavior in a psychiatric interview, we conducted a number
07100 of Turing-like indistinguishability tests (Colby, Hilf,Weber and
07200 Kraemer,1972). We say `Turing-like' because none of them consisted of
07300 playing the two games described above. We chose not to play these
07400 games for a number of reasons which can be summarized by saying that
07500 they do not meet modern criteria for good experimental design. In
07600 designing our tests we were primarily interested in learning more
07700 about developing the model. We did not believe the simple
07800 machine-question to be a useful one in serving the purpose of
07900 progressively increasing the credibility of the model but we
08000 investigated a variation of it to satisfy the curiosity of colleagues
08100 in artificial intelligence.
08200 In this design eight psychiatrists interviewed by teletype
08300 two patients using the technique of machine-mediated interviewing
08400 which involves what we term "non-nonverbal" communication since
08500 non-verbal cues are made impossible (Hilf,1972). Each judge
08600 interviewed two patients one being PARRY and one being a hospitalized
08700 paranoid patient. The interviewers were not informed that a
08800 simulation was involved nor were they asked to identify which was the
08900 machine. Their task was to conduct a diagnostic psychiatric interview
09000 and rate each response from the `patients' along a 0-9 scale of
09100 paranoidness, 0 meaning zero and 9 being highest. Transcripts of
09200 these interviews, without the ratings of the interviewers, were then
09300 utilized for various experiments in which randomly selected expert
09400 judges conducted evaluations of the interview transcripts. For
09500 example, in one experiment it was found that patients and model were
09600 indistinguishable along the dimension of paranoidness.
09610 (Elaborate from ttt paper here giving interviews, data, tables etc.)
09700 To ask the machine-question, we sent interview transcripts,
09800 one with a patient and one with PARRY, to 100 psychiatrists randomly
09900 selected from the Directory of American Specialists and the Directory
10000 of the American Psychiatric Association. Of the 41 replies 21 (51%)
10100 made the correct identification while 20 (49%) were wrong. Based on
10200 this random sample of 41 psychiatrists, the 95% confidence interval
10300 is between 35.9 and 66.5, a range which is close to chance. (Our
10400 statistical consultant was Dr. Helena C. Kraemer, Research
10500 Associate in Biostatistics, Department of Psychiatry, Stanford
10600 University.)
10700 Psychiatrists are considered expert judges of patient
10800 interview behavior but they are unfamiliar with computers. Hence we
10900 conducted the same test with 100 computer scientists randomly
11000 selected from the membership list of the Association for Computing
11100 Machinery, ACM. Of the 67 replies 32 (48%) were right and 35 (52%)
11200 were wrong. Based on this random sample of 67 computer scientists the
11300 95% confidence ranges from 36 to 60, again close to a chance level.
11400 Thus the answer to this machine-question "can expert judges,
11500 psychiatrists aand computer scientists, using teletyped transcripts
11600 of psychiatric interviews, distinguish between paranoid patients and
11700 a simulation of paranoid processes? " is "No". But what do we learn
11800 from this? It is some comfort that the answer was not "yes"and the
11900 null hypothesis (no differences) failed to be rejected, especially
12000 since statistical tests are somewhat biased in favor of rejecting the
12100 null hypothesis (Meehl,1967). Yet this answer does not tell us what
12200 we would most like to know, i.e. how to improve the model.
12300 Simulation models do not spring forth in a complete, perfect and
12400 final form; they must be gradually developed over time. Pehaps we
12500 might obtain a "yes" answer to the machine-question if we allowed a
12600 large number of expert judges to conduct the interviews themselves
12700 rather than studying transcripts of other interviewers. It would
12800 indicate that the model must be improved but unless we systematically
12900 investigated how the judges succeeded in making the discrimination we
13000 would not know what aspects of the model to work on. The logistics of
13100 such a design are immense and obtaining a large N of judges for sound
13200 statistical inference would require an effort disproportionate to the
13300 information-yield.
13400 A more efficient and informative way to use Turing-like tests
13500 is to ask judges to make ordinal ratings along scaled dimensions from
13600 teletyped interviews. We shall term this approach asking the
13700 dimension-question. One can then compare scaled ratings received by
13800 the patients and by the model to precisely determine where and by how
13900 much they differ. Model builders strive for a model which
14000 shows indistinguishability along some dimensions and
14100 distinguishability along others. That is, the model converges on what
14200 it is supposed to simulate and diverges from that which it is not.
14300 We mailed paired-interview transcripts to another 400
14400 randomly selected psychiatrists asking them to rate the responses of
14500 the two `patients' along certain dimensions. The judges were divided
14600 into groups, each judge being asked to rate responses of each I-O
14700 pair in the interviews along four dimensions. The total number of
14800 dimensions in this test were twelve- linguistic noncomprehension,
14900 thought disorder, organic brain syndrome, bizarreness, anger, fear,
15000 ideas of reference, delusions, mistrust, depression, suspiciousness
15100 and mania. These are dimensions which psychiatrists commonly use in
15200 evaluating patients.
15300 Table 1 shows there were significant differences, with PARRY
15400 receiving higher scores along the dimensions of linguistic
15500 noncomprehension,thought disorder, bizarreness, anger, mistrust and
15600 suspiciousness. On the dimension of delusions the patients were rated
15700 significantly higher. There were no significant differences along the
15800 dimensions of organic brain syndrome,fear, ideas of reference,
15900 depression and mania.
16000 While tests asking the machine-question indicate
16100 indistinguishability at the gross level, a study of the finer
16200 structure os the model's behavior through ratings along scaled
16300 dimensions shows statistically significant differences between
16400 patients and model. These differences are of help to the model
16500 builder in suggesting which aspects of the model must be modified and
16600 improved in order to be considered an adequate simulation of the
16700 class of paranoid patients it is intended to simulate. For example,
16800 it is clear that PARRY'S language-comprehension must be improved.
16900 Once this has been implemented, a future test will tell us whether
17000 improvement has occurred and by how much in comparison to the earlier
17100 version. Successive identification of particular areas of failure
17200 in the model permits their improvement and the development of
17300 more adequate model-versions.
17400 Further evidence that the machine-question is too coarse and
17500 insensitive a test comes from the following experiment. In this test
17600 we constructed a random version of the paranoid model which utilized
17700 PARRY'S output statements but expressed them randomly no matter what
17800 the interviewer said. Two psychiatrists conducted interviews with
17900 this model, transcripts of which were paired with patient interviews
18000 and sent to 200 randomly selected psychiatrists asking both the
18100 machine-question and the dimension-question. Of the 69 replies, 34
18200 (49%) were right and 35 (51%) wrong. Based on this random sample of
18300 69 psychiatrists, the 95% confidence interval ranges from 39 to 63,
18400 again indicating a chance level. However as shown in Table 2
18500 significant differences appear along the dimensions of linguistic
18600 noncomprehension, thought disorder and bizarreness, with RANDOM-PARRY
18700 rated higher. On these particular dimensions we can construct a
18800 continuum in which the random version represents one extreme, the
18900 actual patients another. Our (nonrandom) PARRY lies somewhere between
19000 these two extremes, indicating that it performs significantly better
19100 than the random version but still requires improvement before being
19200 indistinguishable from patients.(See Fig.1). Table 3 presents t
19300 values for differences between mean ratings of PARRY and
19400 RANDOM-PARRY. (See Table 2 and Fig.1 for the mean ratings).
19500 Thus it can be seen that such a multidimensional analysis
19600 provides yardsticks for measuring the adequacy of this or any other
19700 dialogue simulation model along the relevant dimensions.
19800 We conclude that when model builders want to conduct tests
19900 of adequacy which indicate in which direction progress lies and to obtain a
20000 measure of whether progress is being achieved, the way to use
20100 Turing-like tests is to ask expert judges to make ratings along
20200 multiple dimensions that are essential to the model. A good validation
20210 procedure has criteris for better or worse approximations. Useful tests do
20300 not prove a model, they probe it for its strengths and weaknesses and
20310 clarify what is to be done next in modifying and repairing the model.
20400 Simply asking the machine-question yields little information relevant
20500 to what the model builder most wants to know, namely, along what
20600 dimensions must the model be improved.
20700
20800